We investigate a dataset about white wine and the influence of chemical properties to the rating of wine experts. The total dataset contains 4898 samples with 11 attributes and 1 test result (median of at least three evaluations between 0 - bad to 10 - excellent quality). This dataset comes from P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
There are 11 physicochemical input variables:
We add a qualitative variable (wine category) to the dataset, which describes how sweet the wine is (based on the residual sugar). If the residual sugar is below 4 g / dm3 then the wine is dry. Between 4 and 12 g / dm3 the wine is medium dry and between 12 and 45 g / dm3 the wine is medium sweet. Above 45 g / dm3 residual sugar the wine is sweet. The output variable quality is a score between 0 and 10 and is based on sensory data.
Wine quality seems normally distributed and 93% of all wines have a quality rating between 5 and 7 points.
There are a some very big values for chlorides level, which are more than four times bigger than the median value. It will be interesting to investigate the chloride’s influence on wine quality rating.
There are 4898 observations, 11 input variables (all numerical), one categorical variable and one output variable:
## 'data.frame': 4898 obs. of 13 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ category : Factor w/ 4 levels "dry","medium dry",..: 3 1 2 2 2 2 2 3 1 1 ...
The different wine categories “dry”, “medium dry” and “medium sweet” have different distributions of the output variable quality. No observation of the wine category “sweet” is in the dataset.
##
## dry medium dry medium sweet sweet
## 3 9 8 3 0
## 4 93 58 12 0
## 5 513 623 321 0
## 6 924 896 377 1
## 7 458 318 104 0
## 8 78 73 24 0
## 9 3 2 0 0
We want to predict the wine quality based on the input features. Hence, it is most interesting which other features most influence this variable. Based on intuition, the features volatile acidity, chlorides, total sulfur dioxide and sulphates are important, because if the level is too high then the wine quality will be bad.
The features citric acid and residual sugar are expected to improve the wine quality. The volume percent of alcohol is not expected to influence the quality of the wine.
Besides the qualitative variable (wine category, see Inroduction), 5 more variables were added to the dataset:
Several features have a skewed distribution or some outliers on the right side. Hence, to make the analysis more robust, for the following variables only values below the 99th quantile are considered:
After the cleaning, 339 observations are removed from the dataset. It now has 4559 observations remaining.
Surprisingly, alcohol level is positively correlated with wine quality. Chlorides, total sulfur dioxide and volatile-acid-proportion have a negative correlation with wine quality.
Most wines have a quality rating between 5 and 7 points. The median alcohol level for wines with a rating of 5 points is 9.6%, for wines with a rating of 6 points 10.5% and for wines with a rating of 7 points 11.4%.
Wine quality and alcohol level are positively correlated. If alcohol level and other input features have a negative correlation then there can be a an indirect negative influence on wine quality (maybe no linear influence of the other input feature).
The median wine quality i equal no matter what wine category is present.
Surprisingly, there is a strong linear correlation between alcohol and quality. On the other hand there is only a weak linear relationship between quality and volatile acidity, chlorides and total sulfur dioxide. Additionally, there is a positive correlation between quality and free sulfur proportion as well as quality and the citric-volatile-ratio, both variables which were introduced during the analysis.
The variable alcohol is stronger negative linear correlated to the variables residul sugar (it is obvious, because sugar is transformed into alcohol during fermentation), chlorides, free sulfur dioxide, total sulfur dioxide and density (again this is obvious, because alcohol has a lower density than water - a higher alcohol level means a lower water level). Hence, a higher alcohol level comes with a lower level of chlorides and sulfur dioxide and improves the quality in this combination.
The strongest positive correlation is between volatile acidity and the volatile acid proportion with r = 0.935. The biggest negative correlation have the variables alcohol and density with r= -0.813. Both correlations are plausible.
Wines with a lower quality rating have a higher chloride level and a lower alcohol level.
Wines with a lower quality rating have a higher free sulfur dioxide level and a lower alcohol level.
Alcohol level and free sulfur dioxide level have a negative correlation. When the alcohol level is low there is a high chloride level a all levels of free sulfur dioxide..
There is an area of high wine quality for a total sulfur dioxide level of 80-130 mg/dm3 and a chloride level of 0.02-0.04 g/dm3.
The area of high wine quality is the same for all wine categories.
There is an area of high wine quality for a total sulfur dioxide level of 80-130 mg/dm3 and a free sulfur dioxide level of 30-50 mg/dm3.
There is an area of high wine quality for a volatile acidity level of 0.15-0.3 g / dm3 and a Citric-Volatile-Ratio of 1-2.
The strong relation between wine quality and alcohol level, which is not intuitive, is based on more complex relationships. Both chlorides, free sulfur dioxide and total sulfur dioxide are negativly correlated with alcohol. Hence, chlorides, free sulfur dioxide and total sulfur dioxide have a negative influence on wine quality - however, no linear relationship. For example, if the level of chlorides is too high, the wine quality goes down, no matter what level of free sulfur dioxide is present. Vice versa, a low level of free sulfur dioxide does not guarantee a high wine quality because of a too high chlorides level. A high level of alcohol has lower levels of both chlorides, free sulfur dioxide and total sulfur dioxide and a higher wine quality consequently.
There is a decreasing level of citric-volatile-ratio (ratio of citric acid to volatile acidity) over volatile acidity with a high wine quality. Apparently, the amount of citric acid level needs to be decreased if the total level of volatile acid increases to get a wine of good quality.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + chlorides, data = wine)
## m3: lm(formula = quality ~ alcohol + chlorides + free.sulfur.ratio,
## data = wine)
## m4: lm(formula = quality ~ alcohol + chlorides + free.sulfur.ratio +
## volatile.ratio, data = wine)
## m5: lm(formula = quality ~ alcohol + chlorides + free.sulfur.ratio +
## volatile.ratio + citric.sugar.ratio, data = wine)
## m6: lm(formula = quality ~ alcohol + chlorides + total.sulfur.dioxide,
## data = wine)
##
## ===========================================================================================
## m1 m2 m3 m4 m5 m6
## -------------------------------------------------------------------------------------------
## (Intercept) 2.561*** 2.978*** 2.608*** 2.822*** 2.682*** 2.655***
## (0.101) (0.133) (0.134) (0.134) (0.133) (0.160)
## alcohol 0.317*** 0.294*** 0.288*** 0.304*** 0.334*** 0.311***
## (0.010) (0.011) (0.010) (0.010) (0.011) (0.011)
## chlorides -4.137*** -3.592*** -3.027*** -2.085* -4.337***
## (0.860) (0.846) (0.836) (0.835) (0.860)
## free.sulfur.ratio 1.623*** 1.420*** 1.295***
## (0.127) (0.127) (0.126)
## volatile.ratio -8.608*** -10.527***
## (0.782) (0.802)
## citric.sugar.ratio -0.995***
## (0.109)
## total.sulfur.dioxide 0.001***
## (0.000)
## -------------------------------------------------------------------------------------------
## R-squared 0.2 0.2 0.2 0.2 0.3 0.2
## adj. R-squared 0.2 0.2 0.2 0.2 0.3 0.2
## sigma 0.8 0.8 0.8 0.8 0.8 0.8
## F 1108.1 568.4 446.6 374.2 321.4 384.3
## p 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -5369.6 -5358.1 -5278.0 -5218.1 -5176.7 -5351.5
## Deviance 2814.7 2800.4 2703.8 2633.6 2586.2 2792.4
## AIC 10745.3 10724.2 10566.1 10448.2 10367.3 10713.0
## BIC 10764.6 10749.9 10598.2 10486.7 10412.3 10745.1
## N 4559 4559 4559 4559 4559 4559
## ===========================================================================================
The best model m5 explains the wine quality with the features alcohol level, chlorides, free sulfur ratio, volatile ratio and citric sugar ratio. All coefficients are statistically significant different from zero. The basic wine quality (intercept) is 2.7 and is increased by a higher alcohol level and a higher free sulfur ratio. The quality is decreased by a higher chloride level, a higher volatile ratio and a higher citric-sugar-ratio. The coefficients are consistent with the correlation matrix. However, the linear model is weak in modeling complex relationships between alcohol level, chlorides and total sulfur dioxide as explained above. In model m6 a higher total sulfur dioxide level will very slightly increase the wine quality. In the corraltion matrix there was a negative correlation between quality and total sulfur dioxide.
For the final plots, two interesting properties of wine, which influences the quality, are shown. The first two plots show the distribution of chloride level and the influence of chloride level to the wine quality. The third plot reveals a realtionship between citric-volatile-ratio (ratio of citric acid to volatile acidity) and volatile acidity and shows an area, where many wines have a high quality rating.
This plot shows the distibution of chlorides in all wines and the median with a vertical line. It appears to be a unimodal distribution. However, there are several wines which contain a considerable higher amount of salt, even three times more than the median. 90% of the chloride levels of all wines are between 0.027 and 0.063 g / dm3.
The plot shows the chlorides level and total sulfur dioxide of the wines and is colored with the wine quality. There is a slight positive correlation between chlorides and total sulfur dioxide. The quality of a wine is low, if the chloride level or the total sulfur dioxide is too high. On the other hand there is an area of chlorides level and total sulfur dioxide where the wine quality is higher on average (read area in the plot). This seems to be a good combination for a good taste. This area of high wine quality has a total sulfur dioxide level of 80-130 mg/dm3 and a chloride level of 0.02-0.04 g/dm3. However, if both levels are lower, the wine quality is low, too.
The plot shows the relationship between the citric-volatile-ratio (ratio of citric acid to volatile acidity) and volatile acidity and is colored by wine quality. There is no linear relationship between both features, however, a higher level of volatile acidity corresponds to a lower level of Citric-Volatile-Ratio. Wine with a high quality has a medium citric-volatile-ratio and the good ratio decreases with an increasing volatile acidity. There is an area of high wine quality for a volatile acidity level of 0.15-0.3 g/dm3 (approximately around the median of volatile acidity 0.26 g/dm3) and a Citric-Volatile-Ratio of 1-2 (also approximately around the median of Citric-Volatile-Ratio 1.27).
The wine data set contains information about about 4500 different white wine and the influence of chemical properties to the rating of wine experts. At first, I started to explore each input variable on its own and created five new variables (ratios) to answer questions about relative amounts. After exploring each variable I assumed that the variables volatile acidity, chlorides, total sulfur dioxide and sulphates are important for wine quality, because if its level is too high, the wine quality will decrease. However, surprisingly there was no high correlation between quality and the four variables. There was only a high correlation between alcohol level and wine quality - a relationship that was unexpected.The strong relation between wine quality and alcohol level is based on more complex relationships. Both chlorides, free sulfur dioxide and total sulfur dioxide are negativly correlated with alcohol. Hence, clorides, free sulfur dioxide and total sulfur dioxide have a negative influence on wine quality - however, no linear relationship. For example, if the level of clorides is too high, the wine quality goes down, no matter what level of free sulfur dioxide is present. Vice versa, a low level of free sulfur dioxide does not guarantee a high wine quality because of a too high clorides level. I tried then to fit al linear model to predict the wine quality based on the input variables alcohol level, chlorides, free sulfur ratio, volatile ratio and citric sugar ratio. The coefficients of the linear model are consistent with the correlation matrix. However, the linear model is weak in modeling complex relationships between alcohol level, chlorides and total sulfur dioxide as explained above. To investigate this data further,I would try to find a better linear model which can represent the complex relationships between the different input variables such that the alcohol level is no more needed to predict the wine quality.